Search CORE

76 research outputs found

ANMM4CBR: a case-based reasoning method for gene expression data classification

Author: A Aamodt
Bangpeng Yao
C Ding
D Berrar
F Díaz
H Li
I Jurisica
J Khan
J Kolodner
J Ye
JY Koo
K Fukunaga
M Bressan
M Dettling
MB Eisen
N Arshadi
OG Troyanskaya
OG Troyanskaya
PJ Park
R Bouckaert
RA Heller
S Dudoit
S Ramaswamy
SC Johnson
Shao Li
TR Golub
TS Furey
U Alon
W Pan
Y Freund
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Accurate classification of microarray data is critical for successful clinical diagnosis and treatment. The "curse of dimensionality" problem and noise in the data, however, undermines the performance of many algorithms. Method In order to obtain a robust classifier, a novel Additive Nonparametric Margin Maximum for Case-Based Reasoning (ANMM4CBR) method is proposed in this article. ANMM4CBR employs a case-based reasoning (CBR) method for classification. CBR is a suitable paradigm for microarray analysis, where the rules that define the domain knowledge are difficult to obtain because usually only a small number of training samples are available. Moreover, in order to select the most informative genes, we propose to perform feature selection via additively optimizing a nonparametric margin maximum criterion, which is defined based on gene pre-selection and sample clustering. Our feature selection method is very robust to noise in the data. Results The effectiveness of our method is demonstrated on both simulated and real data sets. We show that the ANMM4CBR method performs better than some state-of-the-art methods such as support vector machine (SVM) and <it>k </it>nearest neighbor (<it>k</it>NN), especially when the data contains a high level of noise. Availability The source code is attached as an additional file of this paper.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Indirect two-sided relative ranking: a robust similarity measure for gene expression data

Author: CM Perou
DE Arking
DE Martin
E Chávez
E Hubbell
ER DeLong
G Natsoulis
G Wei
GJ Kaspers
GJ Kaspers
IM Chakravarti
J Lamb
J Lamb
J Lu
JL DeRisi
KP Seiler
Lise Getoor
LJ van't Veer
Louis Licamele
OG Troyanskaya
R Pieters
SL Pomeroy
T Hongo
TR Golub
W Liu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background There is a large amount of gene expression data that exists in the public domain. This data has been generated under a variety of experimental conditions. Unfortunately, these experimental variations have generally prevented researchers from accurately comparing and combining this wealth of data, which still hides many novel insights. Results In this paper we present a new method, which we refer to as indirect two-sided relative ranking, for comparing gene expression profiles that is robust to variations in experimental conditions. This method extends the current best approach, which is based on comparing the correlations of the up and down regulated genes, by introducing a comparison based on the correlations in rankings across the entire database. Because our method is robust to experimental variations, it allows a greater variety of gene expression data to be combined, which, as we show, leads to richer scientific discoveries. Conclusions We demonstrate the benefit of our proposed indirect method on several datasets. We first evaluate the ability of the indirect method to retrieve compounds with similar therapeutic effects across known experimental barriers, namely vehicle and batch effects, on two independent datasets (one private and one public). We show that our indirect method is able to significantly improve upon the previous state-of-the-art method with a substantial improvement in recall at rank 10 of 97.03% and 49.44%, on each dataset, respectively. Next, we demonstrate that our indirect method results in improved accuracy for classification in several additional datasets. These datasets demonstrate the use of our indirect method for classifying cancer subtypes, predicting drug sensitivity/resistance, and classifying (related) cell types. Even in the absence of a known (i.e., labeled) experimental barrier, the improvement of the indirect method in each of these datasets is statistically significant.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

Semi-supervised discovery of differential genes

Author: A Lewin
B Efron
C Furlanello
CE Bonferroni
D Singh
E Bair
E Wit
J Neyman
J Storey
J Storey
J Weston
JD Storey
JT Leek
K Najarian
KB Duan
M Bhattacharjee
M Seeger
N Dean
OG Troyanskaya
P Broberg
R Gottardo
R Tibshirani
Shigeyuki Oba
Shin lshii
TR Golub
U Alon
VG Tusher
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Various statistical scores have been proposed for evaluating the significance of genes that may exhibit differential expression between two or more controlled conditions. However, in many clinical studies to detect clinical marker genes for example, the conditions have not necessarily been controlled well, thus condition labels are sometimes hard to obtain due to physical, financial, and time costs. In such a situation, we can consider an unsupervised case where labels are not available or a semi-supervised case where labels are available for a part of the whole sample set, rather than a well-studied supervised case where all samples have their labels. RESULTS: We assume a latent variable model for the expression of active genes and apply the optimal discovery procedure (ODP) proposed by Storey (2005) to the model. Our latent variable model allows gene significance scores to be applied to unsupervised and semi-supervised cases. The ODP framework improves detectability by sharing the estimated parameters of null and alternative models of multiple tests over multiple genes. A theoretical consideration leads to two different interpretations of the latent variable, i.e., it only implicitly affects the alternative model through the model parameters, or it is explicitly included in the alternative model, so that the interpretations correspond to two different implementations of ODP. By comparing the two implementations through experiments with simulation data, we have found that sharing the latent variable estimation is effective for increasing the detectability of truly active genes. We also show that the unsupervised and semi-supervised rating of genes, which takes into account the samples without condition labels, can improve detection of active genes in real gene discovery problems. CONCLUSION: The experimental results indicate that the ODP framework is effective for hypotheses including latent variables and is further improved by sharing the estimations of hidden variables over multiple tests

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Nonparametric relevance-shifted multiple testing procedures for the analysis of high-dimensional multivariate data with small sample sizes

Author: AI Fleishman
C Frömke
C Li
Cornelia Frömke
D Hauschke
DC Polacek
DJ Schaid
E Witt
J Khan
JF Chich
L Guo
LA Hothorn
Ludwig A Hothorn
N Zimmermann
NF Cariello
OG Troyanskaya
PH Westfall
PH Westfall
S Dudoit
S Dudoit
S Holm
S Kropf
S Kropf
S Lange
Siegfried Kropf
T Speed
VR Iyer
Y Benjamini
Y Ge
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background In many research areas it is necessary to find differences between treatment groups with several variables. For example, studies of microarray data seek to find a significant difference in location parameters from zero or one for ratios thereof for each variable. However, in some studies a significant deviation of the difference in locations from zero (or 1 in terms of the ratio) is biologically meaningless. A relevant difference or ratio is sought in such cases. Results This article addresses the use of relevance-shifted tests on ratios for a multivariate parallel two-sample group design. Two empirical procedures are proposed which embed the relevance-shifted test on ratios. As both procedures test a hypothesis for each variable, the resulting multiple testing problem has to be considered. Hence, the procedures include a multiplicity correction. Both procedures are extensions of available procedures for point null hypotheses achieving exact control of the familywise error rate. Whereas the shift of the null hypothesis alone would give straight-forward solutions, the problems that are the reason for the empirical considerations discussed here arise by the fact that the shift is considered in both directions and the whole parameter space in between these two limits has to be accepted as null hypothesis. Conclusion The first algorithm to be discussed uses a permutation algorithm, and is appropriate for designs with a moderately large number of observations. However, many experiments have limited sample sizes. Then the second procedure might be more appropriate, where multiplicity is corrected according to a concept of data-driven order of hypotheses.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Institutionelles Repositorium der Leibniz Universität Hannover

Server für wissenschaftliche Schriften der Hochschule Hannover

Directing Experimental Biology: A Case Study in Mitochondrial Biogenesis

Author: A Goffeau
A Jaimovich
A Sickmann
AB Owen
AH Tong
Amy A. Caudy
Andrey Rzhetsky
AV Kochetov
BJ Blencowe
Burke
C Andreoli
C Huttenhower
Chad L. Myers
CL Myers
CL Myers
CL Myers
Curtis Huttenhower
David C. Hess
DC Hess
E Nabieva
F Foury
F Perocchi
G Giaever
GR Lanckriet
H Kitano
H Koutnikova
H Prokisch
I Boldogh
I Lee
IR Boldogh
JB Moseley
JM Cherry
Kai Li
L Peña-Castillo
LM Steinmetz
M Ashburner
M Babcock
M Grunstein
M Ogur
M Ogur
MA Hibbs
Matthew A. Hibbs
OG Troyanskaya
Olga G. Troyanskaya
P Pavlidis
R Jansen
S DiMauro
TR Hughes
Z Barutcuoglu
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Computational approaches have promised to organize collections of functional genomics data into testable predictions of gene and protein involvement in biological processes and pathways. However, few such predictions have been experimentally validated on a large scale, leaving many bioinformatic methods unproven and underutilized in the biology community. Further, it remains unclear what biological concerns should be taken into account when using computational methods to drive real-world experimental efforts. To investigate these concerns and to establish the utility of computational predictions of gene function, we experimentally tested hundreds of predictions generated from an ensemble of three complementary methods for the process of mitochondrial organization and biogenesis in Saccharomyces cerevisiae. The biological data with respect to the mitochondria are presented in a companion manuscript published in PLoS Genetics (doi:10.1371/journal.pgen.1000407). Here we analyze and explore the results of this study that are broadly applicable for computationalists applying gene function prediction techniques, including a new experimental comparison with 48 genes representing the genomic background. Our study leads to several conclusions that are important to consider when driving laboratory investigations using computational prediction approaches. While most genes in yeast are already known to participate in at least one biological process, we confirm that genes with known functions can still be strong candidates for annotation of additional gene functions. We find that different analysis techniques and different underlying data can both greatly affect the types of functional predictions produced by computational methods. This diversity allows an ensemble of techniques to substantially broaden the biological scope and breadth of predictions. We also find that performing prediction and validation steps iteratively allows us to more completely characterize a biological area of interest. While this study focused on a specific functional area in yeast, many of these observations may be useful in the contexts of other processes and organisms

Public Library of Science (PLOS)

Crossref

The Jackson Laboratory: The Mouseion at the JAXlibrary

Directory of Open Access Journals

PubMed Central

Principal components analysis based methodology to identify differentially expressed genes in time-course microarray data

Author: A Conesa
A Reverter
AA Alizadeh
BM Wise
C Cheng
C Koch
DK Slonim
DR McMillan
G Zhu
H Sakoe
I Simon
JD Storey
JE Jackson
K Nasmyth
M Koranda
MB Eisen
MR Fielden
MS Bartlett
ND Trinklein
NJH Small
NS Holter
O Alter
OG Troyanskaya
PT Spellman
R Tabibiazar
Rajagopalan Srinivasan
S Raychaudhuri
SE Calvano
Sudhakar Jonnalagadda
T Park
V Vinciotti
W Pan
Z Bar-Joseph
Z Bar-Joseph
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Time-course microarray experiments are being increasingly used to characterize dynamic biological processes. In these experiments, the goal is to identify genes differentially expressed in time-course data, measured between different biological conditions. These differentially expressed genes can reveal the changes in biological process due to the change in condition which is essential to understand differences in dynamics. Results In this paper, we propose a novel method for finding differentially expressed genes in time-course data and across biological conditions (say <it>C</it>1 and <it>C</it>2). We model the expression at <it>C</it>1 using Principal Component Analysis and represent the expression profile of each gene as a linear combination of the dominant Principal Components (PCs). Then the expression data from <it>C</it>2 is projected on the developed PCA model and scores are extracted. The difference between the scores is evaluated using a hypothesis test to quantify the significance of differential expression. We evaluate the proposed method to understand differences in two case studies (1) the heat shock response of wild-type and HSF1 knockout mice, and (2) cell-cycle between wild-type and Fkh1/Fkh2 knockout Yeast strains. Conclusion In both cases, the proposed method identified biologically significant genes.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS

Functional Genomics Complements Quantitative Genetics in Identifying Disease-Gene Associations

Author: A Lazary
AI Su
AS Siddiqui
B Linghu
BJ Breitkreutz
Braden Kell
Brem
C Alfarano
C Huttenhower
Cheryl L. Ackert-Bicknell
CJ Bult
CL Ackert-Bicknell
CL Ackert-Bicknell
CL Myers
CL Smith
CT Workman
D Jesse
DR Rhodes
F Rivadeneira
I Lee
I Lee
I Lee
JE Wergedal
JH Moore
JT Eppig
K Hatori
K Xia
KL McGary
KP O'Brien
KR Brown
L Salwinski
LU Gerdes
M Ashburner
Matthew A. Hibbs
MD Chikina
MI McCarthy
MS Huang
Nathan D. Price
O Carlborg
OG Troyanskaya
Olga G. Troyanskaya
PT Tarr
S Durinck
S Varghese
TA Manolio
TF Mackay
VN Vapnik
W Cookson
W Zhang
WJ Fu
WY Wang
Y Guan
Y Guan
Y Taes
Yuanfang Guan
Z Hu
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

An ultimate goal of genetic research is to understand the connection between genotype and phenotype in order to improve the diagnosis and treatment of diseases. The quantitative genetics field has developed a suite of statistical methods to associate genetic loci with diseases and phenotypes, including quantitative trait loci (QTL) linkage mapping and genome-wide association studies (GWAS). However, each of these approaches have technical and biological shortcomings. For example, the amount of heritable variation explained by GWAS is often surprisingly small and the resolution of many QTL linkage mapping studies is poor. The predictive power and interpretation of QTL and GWAS results are consequently limited. In this study, we propose a complementary approach to quantitative genetics by interrogating the vast amount of high-throughput genomic data in model organisms to functionally associate genes with phenotypes and diseases. Our algorithm combines the genome-wide functional relationship network for the laboratory mouse and a state-of-the-art machine learning method. We demonstrate the superior accuracy of this algorithm through predicting genes associated with each of 1157 diverse phenotype ontology terms. Comparison between our prediction results and a meta-analysis of quantitative genetic studies reveals both overlapping candidates and distinct, accurate predictions uniquely identified by our approach. Focusing on bone mineral density (BMD), a phenotype related to osteoporotic fracture, we experimentally validated two of our novel predictions (not observed in any previous GWAS/QTL studies) and found significant bone density defects for both Timp2 and Abcg8 deficient mice. Our results suggest that the integration of functional genomics data into networks, which itself is informative of protein function and interactions, can successfully be utilized as a complementary approach to quantitative genetics to predict disease risks. All supplementary material is available at http://cbfg.jax.org/phenotype

CiteSeerX

Public Library of Science (PLOS)

Crossref

The Jackson Laboratory: The Mouseion at the JAXlibrary

Directory of Open Access Journals

PubMed Central

How repetitive are genomes?

Author: A Faiella
AE Mirsky
B Haubold
Bernhard Haubold
CA Thomas Jn
D Gusfield
D Tautz
EA Bennett
EPC Rocha
G Achaz
International Human Genome Sequencing Consortium
J Liu
JI Jordan
JM Hancock
JM Hancock
L Zhou
LE Orgel
M Hofnung
MA Nóbrega
Mouse Genome Sequencing Consortium
N Volfovsky
OG Troyanskaya
R Development Core Team
RA Aras
Rat Genome Sequencing Consortium
RJ Britten
S Kurtz
SS Shapiro
The Chimpanzee Sequencing and Analysis Consortium
Thomas Wiehe
TR Gregory
WF Doolittle
Y Tian
YL Orlov
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Genome sequences vary strongly in their repetitiveness and the causes for this are still debated. Here we propose a novel measure of genome repetitiveness, the index of repetitiveness, I(r), which can be computed in time proportional to the length of the sequences analyzed. We apply it to 336 genomes from all three domains of life. RESULTS: The expected value of I(r )is zero for random sequences of any G/C content and greater than zero for sequences with excess repeats. We find that the I(r )of archaea is significantly smaller than that of eubacteria, which in turn is smaller than that of eukaryotes. Mouse chromosomes have a significantly higher I(r )than human chromosomes and within each genome the Y chromosome is most repetitive. A sliding window analysis reveals that the human HOXA cluster and two surrounding genes are characterized by local minima in I(r). A program for calculating the I(r )is freely available at . CONCLUSION: The general measure of DNA repetitiveness proposed in this paper can be efficiently computed on a genomic scale. This reveals a broad spectrum of repetitiveness among diverse genomes which agrees qualitatively with previous studies of repeat content. A sliding window analysis helps to analyze the intragenomic distribution of repeats

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MPG.PuRe

A Genomewide Functional Network for the Laboratory Mouse

Author: A Abeliovich
AI Su
Andrey Rzhetsky
AS Siddiqui
B Fahrenkrog
B Lehner
B Snel
BJ Breitkreutz
C Alfarano
C Oka
Carol J. Bult
Chad L. Myers
CL Myers
CL Myers
DJ Watts
DP Hill
DR Rhodes
EE Schadt
EI Boyle
F Tong
H Jeong
I Chambers
I Lee
I Lee
Ihor R. Lemischka
JM Stuart
JT Eppig
JT Eppig
K Mitsui
K Xia
KE Bernstein
KI Goh
KP O'Brien
KR Brown
L Franke
L Peña-Castillo
L Salwinski
LA Boyer
M Ashburner
M Kanehisa
N Ivanova
N Novershtern
OG Troyanskaya
Olga G. Troyanskaya
P Tsaparas
R Jansen
R Setsuie
R Sharan
RA Fisher
Rong Lu
S Bandyopadhyay
S Coulomb
S Durinck
T Harata
T Jiang
TK Gandhi
U Ala
W Zhang
W Zhong
Y Chen
Y Qi
YH Loh
Yuanfang Guan
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Establishing a functional network is invaluable to our understanding of gene function, pathways, and systems-level properties of an organism and can be a powerful resource in directing targeted experiments. In this study, we present a functional network for the laboratory mouse based on a Bayesian integration of diverse genetic and functional genomic data. The resulting network includes probabilistic functional linkages among 20,581 protein-coding genes. We show that this network can accurately predict novel functional assignments and network components and present experimental evidence for predictions related to Nanog homeobox (Nanog), a critical gene in mouse embryonic stem cell pluripotency. An analysis of the global topology of the mouse functional network reveals multiple biologically relevant systems-level features of the mouse proteome. Specifically, we identify the clustering coefficient as a critical characteristic of central modulators that affect diverse pathways as well as genes associated with different phenotype traits and diseases. In addition, a cross-species comparison of functional interactomes on a genomic scale revealed distinct functional characteristics of conserved neighborhoods as compared to subnetworks specific to higher organisms. Thus, our global functional network for the laboratory mouse provides the community with a key resource for discovering protein functions and novel pathway components as well as a tool for exploring systems-level topological and evolutionary features of cellular interactomes. To facilitate exploration of this network by the biomedical research community, we illustrate its application in function and disease gene discovery through an interactive, Web-based, publicly available interface at http://mouseNET.princeton.edu

Public Library of Science (PLOS)

Crossref

The Jackson Laboratory: The Mouseion at the JAXlibrary

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

Assembly complexity of prokaryotic genomes using short reads

Author: A Guénoche
AR Rubinov
B Bollobás
B Haubold
C Smith
Carl Kingsford
D Gusfield
DH Huson
DR Zerbino
Dvan den Broek
E Myers
EW Myers
I Simon
J Butler
J Parkhill
JAA Quitzau
JC Dohm
JP Hutchinson
JP Hutchinson
M Antoniotti
M Margulies
Michael C Schatz
Mihai Pop
MJ Chaisson
MJ Chaisson
MS Waterman
N de Bruijn
N Whiteford
OG Troyanskaya
P Medvedev
PA Pevzner
PA Pevzner
R Barrangou
R Idury
S Batzoglou
T van Aardenne-Ehrenfest
TD Harris
WR Jeck
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes. Results We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for <it>de novo </it>reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages). Conclusions Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed.</p

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland